2. Introduction

Art of data exploration is looking at data, rapidly generating hypotheses, quickly testing them, repeating again and again. Goal is to generate many promising leads to explore later.

3. Data Visualization

ggplot2 implments the “grammar of graphics” - a coherent system for describing and building graphs.

3.1.1 Prerequisites

Load the tidyverse:

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

3.2 First steps

Do cars with big engines use more fuel than cars with small engines? What does the relationship between engine size and fuel efficiency look like? Is it positive, negative, linear, nonlinear? Using the mpg data set from ggplot2:

mpg
## # A tibble: 234 × 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
## 1          audi         a4   1.8  1999     4   auto(l5)     f    18    29
## 2          audi         a4   1.8  1999     4 manual(m5)     f    21    29
## 3          audi         a4   2.0  2008     4 manual(m6)     f    20    31
## 4          audi         a4   2.0  2008     4   auto(av)     f    21    30
## 5          audi         a4   2.8  1999     6   auto(l5)     f    16    26
## 6          audi         a4   2.8  1999     6 manual(m5)     f    18    26
## 7          audi         a4   3.1  2008     6   auto(av)     f    18    27
## 8          audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
## 9          audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>

Check out the variables in the data frame:

names(mpg)
##  [1] "manufacturer" "model"        "displ"        "year"        
##  [5] "cyl"          "trans"        "drv"          "cty"         
##  [9] "hwy"          "fl"           "class"

Note, displ is engine size in liters, hwy is fuel efficiency on the highway in mpg. Learn more about the data set:

?mpg

Plot displ against hwy:

ggplot(data = mpg) +
  geom_point(mapping = aes(x=displ, y = hwy))

  # this is an example of a useless plot 
  # geom_point(mapping = aes(x=class, y = drv))

The plot shows a negative relationship between displacement and fuel efficiency on the highway.

3.2.4 Exercises

2. How many rows and columns in mpg?
# see a summary
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
# count the rows (observations)
num_rows <- nrow(mpg)
print(paste("Number of rows in data set: ", num_rows, sep = ""))
## [1] "Number of rows in data set: 234"

3.3 Aesthetic Mappings

Add a third variable to a 2D plot using an ‘aesthetic’ property, which controls things like the size and shape of the points, which are described as a “level”. Can, for example, map the colors of points to the class variable to reveal the class of each car plotted:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Ah, the two-seaters are probably sports cars, which have smaller bodoes and are likely to get better gas mileage than SUVs yet still have a large displacement.

Note, can also map a variable to the size aesthetic. This is not a good idea here, but to practice:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.

Also, the alpha aesthetic to control the transparency of the points:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

Or, the shape:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).

Make all the points blue (notice the color param is outside of aes):

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Setting color like that(outside of aes). Choose:

  • name of color as a string
  • size of a point in mm
  • shape of a point as a number (shapes can have a color and fill color)

3.3.1 Exercises

2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

Check the data set again:

mpg
## # A tibble: 234 × 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
## 1          audi         a4   1.8  1999     4   auto(l5)     f    18    29
## 2          audi         a4   1.8  1999     4 manual(m5)     f    21    29
## 3          audi         a4   2.0  2008     4 manual(m6)     f    20    31
## 4          audi         a4   2.0  2008     4   auto(av)     f    21    30
## 5          audi         a4   2.8  1999     6   auto(l5)     f    16    26
## 6          audi         a4   2.8  1999     6 manual(m5)     f    18    26
## 7          audi         a4   3.1  2008     6   auto(av)     f    18    27
## 8          audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
## 9          audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>

Looks like things like displ are categorical and cty are continuous.

3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
# ggplot(data = mpg) +
  # geom_point(mapping = aes(x = displ, y = hwy, shape = year))

Note the error.

4. What happens if you map the same variable to multiple aesthetics?
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).

Seems ok as long as it makes sense.

5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
?geom_point
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), shape = 21, colour = "black", fill = "white", size = 5, stroke = 5)

6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?

Note, color as an aesthetic, not as one of the parameters of geom_point, can be continuous.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = displ))

Can also set a conditional on the display of the variable and have it colored based on that.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))

3.5 Facets

Add more vaiables to a plot with facets - is particularly useful for categorical variables. Faceting with one variable basically means to plot two variables on x/y axes, then have separate plots split out for each of a third, categorical variable. The “formula” (the variable, aka data structure) passed in after the ~ should be discrete:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~class, nrow = 2)

facet_grid to facet with two variables. “Formula” here is two variables. First, review mpg data set variables again:

?mpg

Plot displacement against highway mpg, facet by drv (front, rear or 4wd) and number of cylinders. Note, better to put more unique variable (with more categorical possibilities) in the columns where (rows~columns):

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv~cyl)

To control which way to facet - in rows or columns - use facet_grid with one formula and a period. For columns:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(.~cyl)

For rows:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(cyl~.)

3.5.1 Exercises

Don’t facet on a continuous variable!

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(cty~.)

3.6 Geometric Objects

Use a different geom to make a different type of plot. So a scatter plot…

ggplot(data = mpg) +
  geom_point(mapping=aes(x=displ,y=hwy))

…can have a line fitted:

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ,y=hwy)) +
  geom_smooth(mapping=aes(x=displ,y=hwy))
## `geom_smooth()` using method = 'loess'

Could use linetype aesthetic to draw unique line for each unique value of a variable. So for 4WD, front and rear wheel drive vehicles, displacement against highway fuel efficiency:

ggplot(data=mpg) +
  geom_smooth(mapping=aes(x=displ,y=hwy,linetype=drv))
## `geom_smooth()` using method = 'loess'

Try overlaying original data back on and coloring:

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ,y=hwy,color=drv)) +
  geom_smooth(mapping=aes(x=displ,y=hwy,linetype=drv,color=drv))
## `geom_smooth()` using method = 'loess'

But notice that code is repeated. Place mapping data in the main ggplot function to give them a global scope for the plot. ggplot will do its best to apply the aesthetics accordingly:

ggplot(data=mpg,mapping=aes(x=displ,y=hwy,linetype=drv,color=drv)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess'

It’s necessary to make aesthetics local to a certain geom sometimes. For instance, a scatter plot can easily be given the third variable with the color aesthetic…

ggplot(data=mpg,mapping=aes(x=displ,y=hwy,color=class)) +
  geom_point()

…but fitting a curve is problematic because it doesn’t make sense to apply a third variable to a curve:

ggplot(data=mpg,mapping=aes(x=displ,y=hwy,color=class)) +
  # geom_point(mapping=aes(color=class)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 5.6935
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.5065
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.65044
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 5.6935
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.5065
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 0.65044
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 4.008
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.708
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.25
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 4.008
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.708
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 0.25

This can be remedied by making the color aesthetic local only to the point geom:

ggplot(data=mpg,mapping=aes(x=displ,y=hwy)) +
  geom_point(mapping=aes(color=class)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess'

Note, the documentation for ggplot isn’t all-encompasing. Would need to visit website to see all geoms for example.

?ggplot

3.6.1 Exercises

se = FALSE in geom_smooth() controls the bounding area around the curve:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

Versus:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess'

Can use show.legend to hide the legend:

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE)
## `geom_smooth()` using method = 'loess'

3.7 Statistical Transformations

Histogram of diamonds dataset. Diamonds grouped by cut (count is generated automatically from grouping the number in each bin - things like this are known as statistical transformations):

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

Can see the underlying statistical transformation function in documentation. Sometimes necessary to manipulate it manually. For example, if data is already grouped (set the stat to “idendtity”):

demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

geom_col works like geom_bar except doesn’t compute anything.

demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)

ggplot(data = demo) +
  geom_col(mapping = aes(x = cut, y = freq))

Can plot bar chart based on proportion by setting the y-axis to be the computed value from the statistical transformation:

?geom_bar # look through documentation to see computed values

Now plot it (note, changing the aesthetic group = 1 for the bar layer):

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

Can pay attention to the statistical transformation. For example, for each category, can plot min to max with median (all from the stat_summary() function in ggplot):

ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    geom = "pointrange", # default geom
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

There are over 20 stats to use. For example stat_bin:

?stat_bin

3.7.1 Exercises

Rewrite the pointrange plot using geom_line() (gives me control over shaping the dot for example):

ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
  geom_line() +
  stat_summary(fun.y = "median", geom = "point", size = 4.5)

3.8 Position Adjustments

Color the bounding box of a bar chart based on the variable variable:

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, color = cut))

Fill bars with color:

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut))

Introduce a second variable to stack items in a single bin:

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity))

Note, stacking is performed by the position argument. position = identity will stack each item in same order found in data - so it doesn’t arrange them properly (note how ‘SI1’ disappears in the ‘ideal’ bar):

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar(position = "identity")

Can get around this with an alpha channel (but probably best to let ggplot order and stack):

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar(alpha = 1/5, position = "identity")

Take away the fill to show where they are (this is actually two plots on top of each other):

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(alpha = 1/5, position = "identity")

ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) + 
  geom_bar(fill = NA, position = "identity")

Use position argument to create a set of bars of same height that stack the proportion within each category.

First, the stacked bar chart again:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

Now, using position argument to make each bar it’s own measurement of proportion:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

Use position = "dogge" to place the overlapping objects directly beside one another:

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

position = "jitter" is not useful for bar charts, but is for scatter plots. Helps avoid over-plotting (values plotted over each other). Remember the scatter plot for displacement and mpg on highway:

ggplot(data = mpg) +
  geom_point(mapping=aes(x=displ,y=hwy))

But there are only 126 points visible here instead of the 234 observations in the dataset because the values are rounded. Can use jitter to add a little random noise to each point to spread them out a bit. Makes it less accurate at small scales, but more revealing at large scales.

ggplot(data = mpg) +
  geom_point(mapping=aes(x=displ,y=hwy), position = "jitter")

3.9 Coordinate Systems

Can flip the Cartesian coordinates to make horizontal boxplots:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

Use coord_quickmap() to set correct aspect ratio for maps:

nz <- map_data("nz")
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

Versus:

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

coord_polar() to reveal a Coxcomb chart. So start with a flipped bar geom:

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_flip()

And make into a Coxcomb:

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_polar()

The parameters in the below template compose the grammar of graphics, which means you can uniquely describe any plot as a combination of a dataset, geom, set of mappings, a stat, a position argument, a coordinate system and a faceting scheme.

ggplot(data = ) + ( mapping = aes(), stat = , position = ) + +